-
Notifications
You must be signed in to change notification settings - Fork 247
Latent MOE & Repeated MTP support for NemotronH; fix KV cache quant export #830
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: jenchen13 <jennifchen@nvidia.com>
Signed-off-by: jenchen13 <jennifchen@nvidia.com>
Signed-off-by: jenchen13 <jennifchen@nvidia.com>
Signed-off-by: jenchen13 <jennifchen@nvidia.com>
Signed-off-by: jenchen13 <jennifchen@nvidia.com>
Signed-off-by: jenchen13 <jennifchen@nvidia.com>
Signed-off-by: jenchen13 <jennifchen@nvidia.com>
Signed-off-by: jenchen13 <jennifchen@nvidia.com>
Signed-off-by: jenchen13 <jennifchen@nvidia.com>
Signed-off-by: jenchen13 <jennifchen@nvidia.com>
Signed-off-by: jenchen13 <jennifchen@nvidia.com>
Signed-off-by: jenchen13 <jennifchen@nvidia.com>
|
Important Review skippedAuto incremental reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the You can disable this status message by setting the
📝 WalkthroughWalkthroughThis PR extends Megatron Core model export and quantization with support for grouped MLP merging, self-attention scaling, and MTP (model tensor parallel) training. It refactors the importer into modular helper methods, improves KV cache scaling factor handling, adds HuggingFace quantization config support, enhances MOE expert synchronization, and integrates TELinear quantization. Changes
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes 🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 5
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
modelopt/torch/export/quant_utils.py (1)
387-422: KV cache dtype now treats all FP4 as affine.
Line 396 computesis_affine, but_compute_kv_cache_dtypeignores it and defaults toTrue, so(2, 1)always maps toKV_CACHE_NVFP4_AFFINEeven when bias is missing. This changes export semantics.🛠️ Suggested fix
- return _compute_kv_cache_dtype(num_bits_list) + return _compute_kv_cache_dtype(num_bits_list, is_affine) -def _compute_kv_cache_dtype(num_bits_list: list[int]) -> str | None: +def _compute_kv_cache_dtype(num_bits_list: list[int], is_affine: bool) -> str | None: """Returns the kv_cache dtype. If num_bits of output_quantizer is (4, 3) then returns FP8; if it is 8, returns int8, otherwise returns None. Args: - modules: The module or list of modules to inspect. + num_bits_list: List of num_bits from KV quantizers. + is_affine: Whether KV quantizers have bias/affine params. Returns: The kv_cache dtype. """ - is_affine = True - if (4, 3) in num_bits_list: return KV_CACHE_FP8modelopt/torch/export/unified_export_megatron.py (1)
745-803:exclude_modulesnever populates due toqformat is Nonecheck.
get_quantization_formatreturnsQUANTIZATION_NONE, notNone, so Line 763 never triggers and the exclude list stays empty.🛠️ Suggested fix
- if qformat is None and "norm" not in prefix: # Add exclude layers for hf_quant_config + if qformat == QUANTIZATION_NONE and "norm" not in prefix: # Add exclude layers self.exclude_modules.append(prefix)
🤖 Fix all issues with AI agents
In `@modelopt/torch/export/plugins/mcore_nemotron.py`:
- Line 103: The inline comment "# TODO ADD MTP export" is stale because the
export mapping in this module now contains MTP entries; remove the TODO or
replace it with a precise note about any remaining work (e.g., which specific
MTP formats or tests are still missing) so readers aren’t misled—update the
comment near the export mapping and any related MTP export code references to
reflect current status.
In `@modelopt/torch/export/plugins/megatron_importer.py`:
- Around line 264-290: The grouped MLP loader (_grouped_mlp_merging) currently
skips any per-expert quantization scales (TODO weight_scale), so when importing
quantized grouped MLPs you must detect and copy the scale tensors into the
module state_dict before calling module.load_state_dict. Modify
_grouped_mlp_merging to (1) inspect the existing state_dict keys for the
quantizer scale key pattern used by TEGroupedMLP (e.g.,
"weight_quantizer._scale" or similar) and (2) for each expert_id call
self._get_safetensor with the matching safetensor key for that expert's scale
(e.g., use the same prefix.format(expert_id) + the scale suffix used in the
checkpoint) and assign that tensor into state_dict under the exact key name
expected by the module (mirroring how you already set
state_dict[f"weight{expert_id}"] = tensor for weights); finally preserve
dtype/shape checks and then call module.load_state_dict(state_dict) as before.
- Around line 693-774: The non-repeated MTP branch never increments or
initializes layer_id so every mtp in the for mtp in model.mtp.layers loop reuses
the same prefix; fix by initializing layer_id (e.g., layer_id = 0) before the
loop and incrementing layer_id at the end of each iteration of the for mtp in
model.mtp.layers loop so subsequent calls to self.rules["mtp.*"](...) use unique
layer indices.
In `@modelopt/torch/export/quant_utils.py`:
- Around line 348-367: The function get_kv_cache_scaling_factor currently
returns a list of two scalars but is annotated as torch.Tensor; update its type
hint to reflect the actual return type (for example List[Optional[torch.Tensor]]
or Tuple[Optional[torch.Tensor], Optional[torch.Tensor]]) and add necessary
typing imports (List, Optional, Tuple) if not present, and adjust the docstring
return description accordingly; keep the implementation that calls
get_scaling_factor on attributes k_bmm_quantizer and v_bmm_quantizer unchanged.
In `@modelopt/torch/export/unified_export_megatron.py`:
- Around line 1248-1255: The type annotation for moe_router_dtype is
inconsistent: export_mcore_gpt_to_hf declares it as str | None but
GPTModelExporter.__init__ and import_mcore_gpt_from_hf use torch.dtype | None;
change GPTModelExporter.__init__ signature to use str | None to match
export_mcore_gpt_to_hf and the implementation that expects string values like
"fp32"/"fp64", update import_mcore_gpt_from_hf to also use str | None for
moe_router_dtype, and add a brief docstring entry for moe_router_dtype in
export_mcore_gpt_to_hf describing allowed string values and meaning.
🧹 Nitpick comments (3)
modelopt/torch/quantization/plugins/megatron.py (1)
49-58: Make TELinear import optional to handle older Megatron-Core versions gracefully.TELinear is available in Megatron-Core v0.9.0+ (Oct 2024), but the current import structure causes the entire TE try-except block to fail if any module is missing, disabling all TE quantization support. While TELinear is now stable, making it optional with a nested try-except is defensive programming that preserves TE support for environments using older Megatron-Core versions. Similarly, guard the
_QuantTEMCoreLinearregistration with an additional check for TELinear availability.Suggested change
try: from megatron.core.extensions.transformer_engine import ( TEColumnParallelGroupedLinear, TEColumnParallelLinear, TEDotProductAttention, TELayerNormColumnParallelLinear, - TELinear, TERowParallelGroupedLinear, TERowParallelLinear, ) + try: + from megatron.core.extensions.transformer_engine import TELinear + except ImportError: + TELinear = None from .transformer_engine import _QuantTEGroupedLinear, _QuantTELayerNormLinear, _QuantTELinear HAS_TE = True except ImportError: HAS_TE = False + TELinear = None-if HAS_TE: +if HAS_TE and TELinear is not None: `@QuantModuleRegistry.register`({TELinear: "te_mcore_Linear"}) class _QuantTEMCoreLinear(_QuantTELinear): passmodelopt/torch/export/unified_export_megatron.py (2)
113-158: Handlemoe_router_dtypeinputs consistently and avoid unconditional prints.
Line 113 still typestorch.dtype, but the implementation only handles string inputs; callers passingtorch.float32now silently getNone. Also, Line 129 prints unconditionally. Consider accepting both types and gating logs.♻️ Suggested refactor
- moe_router_dtype: torch.dtype | None = None, + moe_router_dtype: str | torch.dtype | None = None, @@ - self.moe_router_dtype = None - if moe_router_dtype == "fp32": + self.moe_router_dtype = None + if isinstance(moe_router_dtype, torch.dtype): + self.moe_router_dtype = moe_router_dtype + elif moe_router_dtype == "fp32": self.moe_router_dtype = torch.float32 elif moe_router_dtype == "fp64": self.moe_router_dtype = torch.float64 - print(f"Exporting model with moe_router_dtype: {self.moe_router_dtype}") + if torch.distributed.is_initialized(): + if torch.distributed.get_rank() == 0: + print(f"Exporting model with moe_router_dtype: {self.moe_router_dtype}") + else: + print(f"Exporting model with moe_router_dtype: {self.moe_router_dtype}")
835-870: Gate debug prints behind a logger/flag.
Lines 838–870 print per‑tensor dtype for every module; this is noisy and can slow large exports.
| def _grouped_mlp_merging( | ||
| self, | ||
| module, | ||
| prefix, | ||
| parallel_config: ParallelConfig | None = None, | ||
| is_mtp: bool = False, | ||
| init_expert_id: int = 0, | ||
| num_local_experts: int = 1, | ||
| ): | ||
| if is_mtp: | ||
| if "backbone" in prefix: | ||
| prefix = prefix.replace("backbone", "mtp") | ||
| else: | ||
| prefix = prefix.replace("model", "mtp") | ||
|
|
||
| state_dict = module.state_dict() | ||
| # TODO handle weight_scale | ||
| # weight_scale = state_dict.get("weight_quantizer._scale", None) | ||
|
|
||
| assert module.num_gemms == num_local_experts, ( | ||
| "num_gemms must be equal to num_local_experts in TEGroupedMLP" | ||
| ) | ||
| for expert_id in range(init_expert_id, init_expert_id + num_local_experts): | ||
| tensor = self._get_safetensor(prefix.format(expert_id) + ".weight") | ||
| state_dict[f"weight{expert_id}"] = tensor | ||
|
|
||
| module.load_state_dict(state_dict) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Grouped MLP import skips weight scales.
Line 280 notes a TODO for weight_scale handling; if grouped MLPs can be quantized, this will leave scales stale.
🤖 Prompt for AI Agents
In `@modelopt/torch/export/plugins/megatron_importer.py` around lines 264 - 290,
The grouped MLP loader (_grouped_mlp_merging) currently skips any per-expert
quantization scales (TODO weight_scale), so when importing quantized grouped
MLPs you must detect and copy the scale tensors into the module state_dict
before calling module.load_state_dict. Modify _grouped_mlp_merging to (1)
inspect the existing state_dict keys for the quantizer scale key pattern used by
TEGroupedMLP (e.g., "weight_quantizer._scale" or similar) and (2) for each
expert_id call self._get_safetensor with the matching safetensor key for that
expert's scale (e.g., use the same prefix.format(expert_id) + the scale suffix
used in the checkpoint) and assign that tensor into state_dict under the exact
key name expected by the module (mirroring how you already set
state_dict[f"weight{expert_id}"] = tensor for weights); finally preserve
dtype/shape checks and then call module.load_state_dict(state_dict) as before.
| # MTP | ||
| if hasattr(model, "mtp"): | ||
| print("Importing MTP", flush=True) | ||
| # MTP is the last layer in DeepSeek V3/R1 | ||
| layer_id += 1 | ||
| for mtp in model.mtp: | ||
| self.rules["mtp.fc"](mtp.fc, layer_id) | ||
| if len(model.mtp.layers) == 1: # Repeated MTP | ||
| layer_id = 0 # reset layer_id for repeated MTP | ||
| mtp = model.mtp.layers[0] | ||
|
|
||
| self.rules["mtp.eh_proj"](mtp.eh_proj, layer_id) | ||
| self.rules["mtp.enorm"](mtp.enorm, layer_id) | ||
| self.rules["mtp.hnorm"](mtp.hnorm, layer_id) | ||
| self.rules["mtp.input_layernorm"](mtp.decoder.layers[0].input_layernorm, layer_id) | ||
| if hasattr(mtp.decoder.layers[0].self_attention, "linear_q_proj"): | ||
| self.rules["mtp.linear_q_proj"]( | ||
| mtp.decoder.layers[0].self_attention.linear_q_proj, layer_id | ||
|
|
||
| mtp_model_layers = mtp.mtp_model_layer.layers | ||
| for mtp_model_layer in mtp_model_layers: | ||
| if isinstance(mtp_model_layer, MambaLayer): | ||
| self._import_mamba_layer(mtp_model_layer, layer_id, layer_pbar) | ||
| elif isinstance(mtp_model_layer, TransformerLayer): | ||
| self._import_transformer_layer( | ||
| mtp_model_layer, layer_id, layer_pbar, is_mtp=True | ||
| ) | ||
| else: | ||
| raise ValueError( | ||
| f"Unsupported layer type during MTP import: {type(mtp_model_layer)}" | ||
| ) | ||
|
|
||
| layer_id += 1 | ||
| else: # non-repeated MTP | ||
| for mtp in model.mtp.layers: | ||
| self.rules["mtp.eh_proj"](mtp.eh_proj, layer_id) | ||
| self.rules["mtp.enorm"](mtp.enorm, layer_id) | ||
| self.rules["mtp.hnorm"](mtp.hnorm, layer_id) | ||
| self.rules["mtp.input_layernorm"]( | ||
| mtp.decoder.layers[0].input_layernorm, layer_id | ||
| ) | ||
| else: | ||
| self.rules["mtp.linear_q_down_proj"]( | ||
| mtp.decoder.layers[0].self_attention.linear_q_down_proj, layer_id | ||
| if hasattr(mtp.decoder.layers[0].self_attention, "linear_q_proj"): | ||
| self.rules["mtp.linear_q_proj"]( | ||
| mtp.decoder.layers[0].self_attention.linear_q_proj, layer_id | ||
| ) | ||
| else: | ||
| self.rules["mtp.linear_q_down_proj"]( | ||
| mtp.decoder.layers[0].self_attention.linear_q_down_proj, layer_id | ||
| ) | ||
| self.rules["mtp.linear_q_layernorm"]( | ||
| mtp.decoder.layers[0].self_attention.q_layernorm, layer_id | ||
| ) | ||
| self.rules["mtp.linear_q_up_proj"]( | ||
| mtp.decoder.layers[0].self_attention.linear_q_up_proj, layer_id | ||
| ) | ||
| self.rules["mtp.linear_kv_down_proj"]( | ||
| mtp.decoder.layers[0].self_attention.linear_kv_down_proj, layer_id | ||
| ) | ||
| self.rules["mtp.linear_q_layernorm"]( | ||
| mtp.decoder.layers[0].self_attention.q_layernorm, layer_id | ||
| self.rules["mtp.linear_kv_layernorm"]( | ||
| mtp.decoder.layers[0].self_attention.kv_layernorm, layer_id | ||
| ) | ||
| self.rules["mtp.linear_q_up_proj"]( | ||
| mtp.decoder.layers[0].self_attention.linear_q_up_proj, layer_id | ||
| self.rules["mtp.linear_kv_up_proj"]( | ||
| mtp.decoder.layers[0].self_attention.linear_kv_up_proj, layer_id | ||
| ) | ||
| self.rules["mtp.linear_kv_down_proj"]( | ||
| mtp.decoder.layers[0].self_attention.linear_kv_down_proj, layer_id | ||
| ) | ||
| self.rules["mtp.linear_kv_layernorm"]( | ||
| mtp.decoder.layers[0].self_attention.kv_layernorm, layer_id | ||
| ) | ||
| self.rules["mtp.linear_kv_up_proj"]( | ||
| mtp.decoder.layers[0].self_attention.linear_kv_up_proj, layer_id | ||
| ) | ||
| self.rules["mtp.linear_proj"]( | ||
| mtp.decoder.layers[0].self_attention.linear_proj, layer_id | ||
| ) | ||
| self.rules["mtp.pre_mlp_layernorm"]( | ||
| mtp.decoder.layers[0].pre_mlp_layernorm, layer_id | ||
| ) | ||
| self.rules["mtp.router"](mtp.decoder.layers[0].mlp.router, layer_id) | ||
| self.rules["mtp.shared_experts.linear_fc1"]( | ||
| mtp.decoder.layers[0].mlp.shared_experts.linear_fc1, layer_id | ||
| ) | ||
| self.rules["mtp.shared_experts.linear_fc2"]( | ||
| mtp.decoder.layers[0].mlp.shared_experts.linear_fc2, layer_id | ||
| ) | ||
| for expert_id, expert in tqdm( | ||
| enumerate(mtp.decoder.layers[0].mlp.experts.local_experts), | ||
| desc="Importing MoE local experts", | ||
| leave=False, | ||
| disable=self.disable_tqdm, | ||
| ): | ||
| self.rules["mtp.local_experts.linear_fc1"]( | ||
| expert.linear_fc1, layer_id, expert_id | ||
| self.rules["mtp.linear_proj"]( | ||
| mtp.decoder.layers[0].self_attention.linear_proj, layer_id | ||
| ) | ||
| self.rules["mtp.local_experts.linear_fc2"]( | ||
| expert.linear_fc2, layer_id, expert_id | ||
| self.rules["mtp.pre_mlp_layernorm"]( | ||
| mtp.decoder.layers[0].pre_mlp_layernorm, layer_id | ||
| ) | ||
| self.rules["mtp.router"](mtp.decoder.layers[0].mlp.router, layer_id) | ||
| self.rules["mtp.shared_experts.linear_fc1"]( | ||
| mtp.decoder.layers[0].mlp.shared_experts.linear_fc1, layer_id | ||
| ) | ||
| self.rules["mtp.shared_experts.linear_fc2"]( | ||
| mtp.decoder.layers[0].mlp.shared_experts.linear_fc2, layer_id | ||
| ) | ||
| for expert_id, expert in tqdm( | ||
| enumerate(mtp.decoder.layers[0].mlp.experts.local_experts), | ||
| desc="Importing MoE local experts", | ||
| leave=False, | ||
| disable=self.disable_tqdm, | ||
| ): | ||
| self.rules["mtp.local_experts.linear_fc1"]( | ||
| expert.linear_fc1, layer_id, expert_id | ||
| ) | ||
| self.rules["mtp.local_experts.linear_fc2"]( | ||
| expert.linear_fc2, layer_id, expert_id | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Non‑repeated MTP import reuses the same layer_id for every layer.
layer_id is never updated in the non‑repeated branch, so each iteration overwrites the same prefix.
🛠️ Suggested fix
- else: # non-repeated MTP
- for mtp in model.mtp.layers:
+ else: # non-repeated MTP
+ for mtp_id, mtp in enumerate(model.mtp.layers):
+ layer_id = mtp_id
self.rules["mtp.eh_proj"](mtp.eh_proj, layer_id)
self.rules["mtp.enorm"](mtp.enorm, layer_id)
self.rules["mtp.hnorm"](mtp.hnorm, layer_id)🤖 Prompt for AI Agents
In `@modelopt/torch/export/plugins/megatron_importer.py` around lines 693 - 774,
The non-repeated MTP branch never increments or initializes layer_id so every
mtp in the for mtp in model.mtp.layers loop reuses the same prefix; fix by
initializing layer_id (e.g., layer_id = 0) before the loop and incrementing
layer_id at the end of each iteration of the for mtp in model.mtp.layers loop so
subsequent calls to self.rules["mtp.*"](...) use unique layer indices.
| def export_mcore_gpt_to_hf( | ||
| model: torch.nn.Module, | ||
| pretrained_model_name_or_path: str | os.PathLike | None = None, | ||
| export_extra_modules: bool = False, | ||
| dtype: torch.dtype = torch.bfloat16, | ||
| export_dir: Path | str = tempfile.gettempdir(), | ||
| moe_router_dtype: torch.dtype | None = None, | ||
| moe_router_dtype: str | None = None, | ||
| ): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Verify call sites and parameter usage for the signature change.
rg -n --type=py 'export_mcore_gpt_to_hf\('
rg -n --type=py 'GPTModelExporter\('Repository: NVIDIA/Model-Optimizer
Length of output: 528
🏁 Script executed:
#!/bin/bash
# Examine the test call site
echo "=== Test call site (line 94) ==="
sed -n '85,105p' tests/gpu/torch/export/test_unified_export_megatron.py
# Examine the function and GPTModelExporter instantiation
echo "=== Function definition and GPTModelExporter instantiation ==="
sed -n '1248,1280p' modelopt/torch/export/unified_export_megatron.py
# Check if there's a docstring with parameter documentation
echo "=== Full function with docstring ==="
sed -n '1248,1300p' modelopt/torch/export/unified_export_megatron.pyRepository: NVIDIA/Model-Optimizer
Length of output: 4205
🏁 Script executed:
#!/bin/bash
# Find GPTModelExporter class definition
echo "=== GPTModelExporter class definition ==="
rg -n 'class GPTModelExporter' --type=py
# Get the class definition and __init__ method
echo "=== GPTModelExporter __init__ ==="
rg -A 20 'class GPTModelExporter' --type=pyRepository: NVIDIA/Model-Optimizer
Length of output: 2142
🏁 Script executed:
#!/bin/bash
# Get the full __init__ method of GPTModelExporter
sed -n '100,150p' modelopt/torch/export/unified_export_megatron.pyRepository: NVIDIA/Model-Optimizer
Length of output: 2574
Fix type annotation inconsistency for moe_router_dtype parameter.
The export_mcore_gpt_to_hf function signature correctly uses str | None, but GPTModelExporter.__init__ incorrectly declares it as torch.dtype | None. The implementation in GPTModelExporter (lines 127-130) expects string values like "fp32" and "fp64", not torch.dtype objects. Update GPTModelExporter.__init__ to use str | None to match the actual implementation and the public API.
Additionally, the moe_router_dtype parameter is missing from the docstring of export_mcore_gpt_to_hf, and the type signature for moe_router_dtype in the related import_mcore_gpt_from_hf function uses torch.dtype | None, creating inconsistency across the export/import interface.
🤖 Prompt for AI Agents
In `@modelopt/torch/export/unified_export_megatron.py` around lines 1248 - 1255,
The type annotation for moe_router_dtype is inconsistent: export_mcore_gpt_to_hf
declares it as str | None but GPTModelExporter.__init__ and
import_mcore_gpt_from_hf use torch.dtype | None; change
GPTModelExporter.__init__ signature to use str | None to match
export_mcore_gpt_to_hf and the implementation that expects string values like
"fp32"/"fp64", update import_mcore_gpt_from_hf to also use str | None for
moe_router_dtype, and add a brief docstring entry for moe_router_dtype in
export_mcore_gpt_to_hf describing allowed string values and meaning.
Signed-off-by: jenchen13 <jennifchen@nvidia.com>
a87e6a9 to
e41783a
Compare
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #830 +/- ##
==========================================
- Coverage 73.82% 73.38% -0.45%
==========================================
Files 193 193
Lines 19745 19894 +149
==========================================
+ Hits 14577 14599 +22
- Misses 5168 5295 +127 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
| # Step 1: Sync amax across local experts in a SequentialMLP | ||
| for name, module in model.named_modules(): | ||
| if hasattr(module, "sync_moe_local_experts_amax"): | ||
| module.sync_moe_local_experts_amax() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function says to sync across EP and ETP but I didn't see any collective communication happen in the function. The description needs to be changed I think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can change the name of the method. What it's actually doing is just looping through the local experts and assigning the same amax
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it is syncing local experts for sequential moe.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the name does not say or indicate it is collective comm.
|
I have concerns on "Remove KV quantization scale clamping to 1" without accuracy data support. This also deviates with the HF ptq code path. I think we need to have a discussion about the impact. |
realAsma
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! I did not look at export code.
Signed-off-by: jenchen13 <jennifchen@nvidia.com>
What does this PR do?
Type of change: New feature and bug fix
Overview:
Support Latent MOE and Repeated MTP for NemotronH models
qkv_layer.output_quantizerexport & replace with properk/v_bmm_quantizerlogicUsage
# Add a code snippet demonstrating how to use thisTesting
Before your PR is "Ready for review"
Additional Information
Summary by CodeRabbit
Release Notes
New Features
Refactor
✏️ Tip: You can customize this high-level summary in your review settings.